-
Notifications
You must be signed in to change notification settings - Fork 12.4k
HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I would be happy to get on a call with you to discuss AMD hardware support, my email address can be found on my Github page. |
@deepsek Thanks for the contribution and for reaching out. On topics related to the CUDA backend, @JohannesGaessler is the best person to consult with. For additional backends, @slaren can provide guidelines and advice. I'll be happy to provide input on any matters as well. I am also available for call - feel free to contact me. |
Very nice to see the initiative. I assume improvements made for CDNA will also swap into the consumer side next year when UDNA releases. So this is exciting news for the future of AMD products! |
This certainly is good news |
Sorry, I wanted to ask: @IMbackK since you've been working on AMD support, are you interested in joining the discussion? |
Yes, certainly. It would help to avoid duplication of effort. i can be reached via email at uvos.xyz user carl |
Added Matrix cores support (MFMA instructions) for MMQ kernels.
Enable stream-K for CDNA3 to work with MMQ kernels.
Removed usage of WARP_SIZE hardcoded constant in MMQ kernels.
NOTE: Thoughts on removing all uses of hardcoded const specific to only NVIDIA (like WARP_SIZE) in order to support other GPUs?
@JohannesGaessler @ggerganov
P.S. I am part of an AMD team actively working on enabling AMD feature set on llama.cpp. We would like to get on call to discuss some future PR plans for additional backends, flash attention changes, etc.
EDIT:
Update to add some performance charts for DeepSeekV3 model.
Upstream vs ROCm Fork Development

MI300X vs H100 Throughput Test
